Machine Learning Analysis Report

Generated on August 03, 2025 at 09:19 PM

Machine Learning Analysis Pipeline

EDR: Dataset Loading & Preprocessing

EDR – Train/Test Overview
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955

EDR: Model Performance Comparison

EDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.94100.66100.02970.37840.05510.64270.0262
Random Forest (SMOTE)0.88310.59830.01230.31080.02360.79790.0263
LightGBM0.70340.62230.00830.54050.01630.66370.0071
Balanced RF0.89140.68310.01980.47300.03810.85970.0581
SGD SVM0.93220.62960.02220.32430.0416nannan
IsolationForest0.99160.51830.04410.04050.0423nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression1529891546285.64%62.16%
Random Forest (SMOTE)143601853512311.43%68.92%
LightGBM114164797344029.59%45.95%
Balanced RF144831730393510.67%52.70%
SGD SVM15158105550246.51%67.57%
IsolationForest16148657130.40%95.95%

Best Models by Metric

Accuracy
IsolationForest
0.9916
Balanced Acc
Balanced RF
0.6831
Precision
IsolationForest
0.0441
Recall
LightGBM
0.5405
F1
Logistic Regression
0.0551
ROC-AUC
Balanced RF
0.8597
PR-AUC
Balanced RF
0.0581
Lowest False Positive Rate
IsolationForest
0.40%
Lowest Miss Rate
LightGBM
45.95%

EDR – Metrics by Model

EDR – Metrics by Model

EDR – ROC Curves

EDR – ROC Curves

EDR – Precision–Recall Curves

EDR – Precision–Recall Curves

EDR – Predicted Probability Distributions

EDR – Predicted Probability Distributions

EDR – Threshold Sweep

EDR – Threshold Sweep

EDR: Logistic Regression – Detailed Analysis

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99700.94360.969516213.0000
10.02970.37840.055174.0000
accuracynannan0.941016287.0000

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR: Random Forest (SMOTE) – Detailed Analysis

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99650.88570.937816213.0000
10.01230.31080.023674.0000
accuracynannan0.883116287.0000

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR: LightGBM – Detailed Analysis

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99700.70410.825416213.0000
10.00830.54050.016374.0000
accuracynannan0.703416287.0000

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR: Balanced RF – Detailed Analysis

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99730.89330.942416213.0000
10.01980.47300.038174.0000
accuracynannan0.891416287.0000

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR: SGD SVM – Detailed Analysis

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99670.93490.964816213.0000
10.02220.32430.041674.0000
accuracynannan0.932216287.0000

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR: IsolationForest – Detailed Analysis

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99560.99600.995816213.0000
10.04410.04050.042374.0000
accuracynannan0.991616287.0000

EDR – IsolationForest: Feature Importance

Feature importance not available for this model type.

XDR: Dataset Loading & Preprocessing

XDR – Train/Test Overview
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955

XDR: Model Performance Comparison

XDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.94060.60020.02040.25680.03780.65670.0233
Random Forest (SMOTE)0.91370.58000.01320.24320.02500.80370.0499
LightGBM0.83900.62990.01190.41890.02310.75640.0111
Balanced RF0.89510.69840.02170.50000.04150.85840.0671
SGD SVM0.87090.61230.01250.35140.0241nannan
IsolationForest0.99440.51290.09090.02700.0417nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression1530091355195.63%74.32%
Random Forest (SMOTE)14864134956188.32%75.68%
LightGBM136332580433115.91%58.11%
Balanced RF145411672373710.31%50.00%
SGD SVM141582055482612.68%64.86%
IsolationForest16193207220.12%97.30%

Best Models by Metric

Accuracy
IsolationForest
0.9944
Balanced Acc
Balanced RF
0.6984
Precision
IsolationForest
0.0909
Recall
Balanced RF
0.5000
F1
IsolationForest
0.0417
ROC-AUC
Balanced RF
0.8584
PR-AUC
Balanced RF
0.0671
Lowest False Positive Rate
IsolationForest
0.12%
Lowest Miss Rate
Balanced RF
50.00%

XDR – Metrics by Model

XDR – Metrics by Model

XDR – ROC Curves

XDR – ROC Curves

XDR – Precision–Recall Curves

XDR – Precision–Recall Curves

XDR – Predicted Probability Distributions

XDR – Predicted Probability Distributions

XDR – Threshold Sweep

XDR – Threshold Sweep

XDR: Logistic Regression – Detailed Analysis

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99640.94370.969316213.0000
10.02040.25680.037874.0000
accuracynannan0.940616287.0000

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR: Random Forest (SMOTE) – Detailed Analysis

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99620.91680.954916213.0000
10.01320.24320.025074.0000
accuracynannan0.913716287.0000

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR: LightGBM – Detailed Analysis

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99690.84090.912216213.0000
10.01190.41890.023174.0000
accuracynannan0.839016287.0000

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR: Balanced RF – Detailed Analysis

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99750.89690.944516213.0000
10.02170.50000.041574.0000
accuracynannan0.895116287.0000

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR: SGD SVM – Detailed Analysis

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99660.87320.930916213.0000
10.01250.35140.024174.0000
accuracynannan0.870916287.0000

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR: IsolationForest – Detailed Analysis

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99560.99880.997216213.0000
10.09090.02700.041774.0000
accuracynannan0.994416287.0000

XDR – IsolationForest: Feature Importance

Feature importance not available for this model type.